Introduction

Main objective of this project is to compare different supervised learning methods to increase performance of AI classifying model.
I chose X (Twitter) posts dataset from Kaggle to extract modern language vocabulary.

This project includes:

  • Analyzing and visualizing significant information from X posts.
  • Preprocessing text data.
  • Creating AI model that predicts sentiment from given sentence.
  • Comparison and reflection on each model variant. #### Chapters:
  1. Dataset overview
  2. Processing text data 1) Removing duplicates 2) Balancing the data 3) Length of sentences
  3. Creating the Model 1) Class weighting 2) Text tokenizing 3) Training the model 4) Hyperparameter tuning 5) Testing in practice
  4. Conclusion

I - Dataset overview

Checking the character of dataset and looking for any significant clues that could be used later on.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

df = pd.read_csv('tweets.csv', index_col=0)
df
Datetime Tweet Id Text Username sentiment sentiment_score emotion emotion_score
0 2022-09-30 23:29:15+00:00 1575991191170342912 @Logitech @apple @Google @Microsoft @Dell @Len... ManjuSreedaran neutral 0.853283 anticipation 0.587121
1 2022-09-30 21:46:35+00:00 1575965354425131008 @MK_habit_addict @official_stier @MortalKombat... MiKeMcDnet neutral 0.519470 joy 0.886913
2 2022-09-30 21:18:02+00:00 1575958171423752203 Asย @CRNย celebrates its 40th anniversary,ย Bob F... jfollett positive 0.763791 joy 0.960347
3 2022-09-30 20:05:24+00:00 1575939891485032450 @dell your customer service is horrible especi... daveccarr negative 0.954023 anger 0.983203
4 2022-09-30 20:03:17+00:00 1575939359160750080 @zacokalo @Dell @DellCares @Dell give the man ... heycamella neutral 0.529170 anger 0.776124
... ... ... ... ... ... ... ... ...
24965 2022-01-01 02:02:04+00:00 1477097760931336198 @ElDarkAngel2 @GamersNexus @Dell I wouldn't ev... Eodart negative 0.682981 anger 0.906309
24966 2022-01-01 01:57:34+00:00 1477096631300415496 @kite_real @GamersNexus @Dell I didn't really ... Eodart positive 0.743940 joy 0.951701
24967 2022-01-01 01:36:36+00:00 1477091355629432833 Hey @JoshTheFixer here it is....27 4K UHD USB-... Corleone250 neutral 0.654463 anticipation 0.471185
24968 2022-01-01 01:31:30+00:00 1477090070830141442 @bravadogaming @thewolfpena @Alienware @intel ... MrTwistyyy neutral 0.794049 anticipation 0.747014
24969 2022-01-01 00:59:37+00:00 1477082048900726784 @rabia_ejaz @Dell Stopped buying windows lapto... IDevourNehari positive 0.733861 joy 0.958346

24970 rows ร— 8 columns

Since we want to predict just the sentiment form text data,
I decided to throw all unnecessary columns, leaving only 'text' and 'sentiment'.

df.drop(['Datetime', 'Tweet Id', 'Username', 'sentiment_score', 'emotion', 'emotion_score'], axis=1, inplace=True)
df.rename(columns={'Text': 'text', 'sentiment': 'label'}, inplace=True)

# renaming classes
labels = {
    'negative': 0,
    'neutral': 1,
    'positive': 2,
}

df['label'] = df['label'].apply(lambda x: labels[x])
df
text label
0 @Logitech @apple @Google @Microsoft @Dell @Len... 1
1 @MK_habit_addict @official_stier @MortalKombat... 1
2 Asย @CRNย celebrates its 40th anniversary,ย Bob F... 2
3 @dell your customer service is horrible especi... 0
4 @zacokalo @Dell @DellCares @Dell give the man ... 1
... ... ...
24965 @ElDarkAngel2 @GamersNexus @Dell I wouldn't ev... 0
24966 @kite_real @GamersNexus @Dell I didn't really ... 2
24967 Hey @JoshTheFixer here it is....27 4K UHD USB-... 1
24968 @bravadogaming @thewolfpena @Alienware @intel ... 1
24969 @rabia_ejaz @Dell Stopped buying windows lapto... 2

24970 rows ร— 2 columns

II - Processing text data

Step 1 - Removing duplicates

The main problems with duplicates: 1) Duplicated sentences can lead to label favoritism.
2) One sentence may contain 2 different labels,
with that we can't really tell which one is the correct one.

df.isna().sum()
text     0
label    0
dtype: int64
df.duplicated().sum()
331
from copy import deepcopy

sns.set(style="whitegrid")
NAVY_LIGHT = '#4B527E'
NAVY_DARK = '#7C81AD'

# For better plot visualisation I saved duplicates  in second data frame
df_duplicates = deepcopy(df)
df.drop_duplicates(subset='text', keep='first', inplace=True)

fig = plt.figure()
fig = sns.countplot(df_duplicates, x='label', color='red', linewidth=0)
fig = sns.countplot(df, x='label', color=NAVY_DARK, linewidth=0)

plt.tight_layout()
plt.title('Duplicates per label')
plt.legend(['duplicates', 'unique'])
<matplotlib.legend.Legend at 0x1c1a2c93710>
df.duplicated('text').sum()
0

Step 2 - Balancing the classes

The balance between each label seems a little incohesive especially if we look at the first column that stand out of the rest.
We downsample the classes 0 and 2 to the smallest one which is 1.

By undersampling classes we get rid of unbalanced data problem by the cost of less data for training the model.

df.value_counts('label')
label
0    10483
2     7167
1     6989
Name: count, dtype: int64
from sklearn.utils import resample

df_by_label = lambda i: df[df['label'] == i]
downsampled_df = pd.DataFrame(data=df_by_label(1))
max_size = len(df_by_label(1))

# resampling classes
for i in [0, 2]:
    downsampled_class = resample(df_by_label(i),
                                 replace=True,
                                 n_samples=max_size,
                                 random_state=42)
    downsampled_df = pd.concat([downsampled_df, downsampled_class])

fig = plt.figure()

fig = sns.countplot(df, x='label', color=NAVY_LIGHT)

fig = sns.countplot(downsampled_df, x='label', color=NAVY_DARK, linewidth=0)
fig.axhline(y=max_size, color='r', linestyle=':')
fig.set_title('Balancing classes')

plt.tight_layout()
plt.show()

Step 3 - Length of sentences

By grouping sentences based on their length in words, we can identify and exclude groups with specific length that occurs less than 100 times in the dataset.
Basically, we remove sentences that are shorter than 3 words and longer than 55 words to achieve even more balanced dataset.

What I found intrestingly odd is that big jump in second plot in negative class at around 46 length.
Seems like negative sentences tend to be longer but this clue is too small to be taken seriously.
More safely we can deduce that negative class contain more long sentences.

from matplotlib.lines import Line2D
from scipy.stats import hmean

# creating 3rd column that contains length in words of corresponding text data
df['length'] = df['text'].apply(lambda x: len(x.split()))
downsampled_df['length'] = downsampled_df['text'].apply(lambda x: len(x.split()))
df_by_label = lambda x: df[df['label'] == x]

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(20, 10))
sns.countplot(df, x='length', ax=axes[0])

colors = ['red', 'grey', 'green']
for i, color in enumerate(colors):
    count_by_length = df_by_label(i)['length'].value_counts().sort_index()
    sns.lineplot(x=count_by_length.index, y=count_by_length.values, marker=None, color=color, linewidth=3, ax=axes[1])

axes[0].set_title('Whole dataset')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Length (in words)')
axes[0].set_xlim(-1, 65)

axes[1].set_title('By class')
axes[1].set_ylabel('Count')
axes[1].set_xlabel('Length (in words)')
axes[1].set_xlim(1, 70)

legend_labels = ['negative', 'neutral', 'positive']
custom_lines = [Line2D([0], [0], color=colors[i], lw=3) for i in range(len(colors))]
axes[1].legend(custom_lines, legend_labels)

print(f"{'Mean':>20}: {int(df['length'].mean())}\n"
      f"{'Harmonic mean':>20}: {int(hmean(df['length']))}")

fig.suptitle('Amount of sentences per length', fontsize=20)
plt.tight_layout()
plt.show()
                Mean: 26
       Harmonic mean: 16
# cutting off the sentences, technically we don't need the length column for this
df_cut = df[(df['length'] >= 3) & (df['length'] <= 55)]
downsampled_df = downsampled_df[(downsampled_df['length'] >= 3) & (downsampled_df['length'] <= 55)]

min_count = len(df[df['length'] == 100])

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(30, 5))
sns.countplot(df, x='length', color=NAVY_LIGHT)
sns.countplot(df_cut, x='length', color=NAVY_DARK)

plt.axhline(y=100, color='red', linestyle=':')

print(f"{"Mean":>20}: {int(df['length'].mean())}\n"
      f"{"Harmonic mean":>20}: {int(hmean(df['length']))}")

fig.suptitle('Amount of sentences per length', fontsize=20)
axes.set_xlabel('Length (in words)')
plt.show()

df = df_cut
                Mean: 26
       Harmonic mean: 16

III Creating the Model

Step 1 - Class weighting

First we prepare weights of our classes from unbalanced dataset to avoid label favoritism.

weights = dict()
for i in range(3):
    weights[i] = len(df) / len(df[df['label'] == i]) * 6
weights
{0: 14.108060917644776, 1: 21.275599765944996, 2: 20.49894291754757}

Step 2 - Text tokenizing

To give our model more clear understanding of text data, we tokenize it.
With custom function we have more possibilities such as: removing emojis and mentions ('@username').

from sklearn.feature_extraction.text import BaseEstimator, TransformerMixin, TfidfVectorizer
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from string import punctuation
import emoji

class TextTokenizer(BaseEstimator, TransformerMixin):
    def __init__(self, emoji=True, mentions=True):
        self.stemmer = SnowballStemmer('english')
        self.emoji = emoji
        self.mentions = mentions
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        processed_text = []
        
        for text in X:
            if self.emoji:
                text = emoji.demojize(text, delimiters=("", " "))
                text = text.replace("_", " ")
            if self.mentions:
                text = ' '.join(word for word in text.split() if word[0] != '@')
            text = ''.join(char.lower() for char in text if char not in punctuation)
            tokens = ' '.join(self.stemmer.stem(word) for word in text.split())
            processed_text.append(tokens)
            
        return processed_text

Testing functionality of our text tokenizer.

a = "๐Ÿ˜Š๐Ÿ˜ก   i don't care! #happy"
b = "smiling face emoji"
c = '@user23 thats #cRaZy!'

tokenizer = TextTokenizer()
for sentence in [a, b, c]:
    print(f"before: {sentence}\n after: {tokenizer.transform([sentence])}")
    print('')
before: ๐Ÿ˜Š๐Ÿ˜ก   i don't care! #happy
 after: ['smile face with smile eye enrag face i dont care happi']

before: smiling face emoji
 after: ['smile face emoji']

before: @user23 thats #cRaZy!
 after: ['that crazi']

Step 3 - Training the model

Splitting datasets for downsampled and unbalanced versions.

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer

# TRAIN TEST SPLIT FOR DEFAULT DATA
x = df.iloc[:, 0]
y = df.iloc[:, 1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# TRAIN TEST SPLIT FOR DOWNSAMPLED DATA
x_2 = downsampled_df.iloc[:, 0]
y_2 = downsampled_df.iloc[:, 1]
x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(x_2, y_2, test_size=0.2, random_state=42)

Now we create 3 variants with the same AI Model:

  • sgd - Unbalanced (weighted)
  • sgd_2 - Unbalanced (weighted) + custom tokenizer
  • sgd_ds - Downsampled
  • sgd_ds_2 - Downsampled + custom tokenizer
    Unbalanced basically uses the unbalanced dataset while balanced, downsampled one.

The pipeline below is actually after hyperparameter tuning.

sgd_ds_2 = Pipeline([
    ('tok', TextTokenizer()),
    ('vec', CountVectorizer(ngram_range=(1, 2), stop_words=stopwords.words('english'))),
    ('clf', SGDClassifier(random_state=42, n_jobs=-1, shuffle=True))])

# default - unbalanced data, no custom transformer
sgd = deepcopy(sgd_ds_2)
sgd.set_params(clf__class_weight=weights)
sgd.steps.remove(sgd.steps[0])

# default - unbalanced data + custom transformer
sgd_2 = deepcopy(sgd_ds_2)
sgd_2.set_params(clf__class_weight=weights)

# downsampled, no custom transformer
sgd_ds = deepcopy(sgd_ds_2)
sgd_ds.steps.remove(sgd_ds.steps[0])

Just making sure if we assigned parameters correctly.

sgd, sgd_2, sgd_ds, sgd_ds_2
(Pipeline(steps=[('vec',
                  CountVectorizer(ngram_range=(1, 2),
                                  stop_words=['i', 'me', 'my', 'myself', 'we',
                                              'our', 'ours', 'ourselves', 'you',
                                              "you're", "you've", "you'll",
                                              "you'd", 'your', 'yours',
                                              'yourself', 'yourselves', 'he',
                                              'him', 'his', 'himself', 'she',
                                              "she's", 'her', 'hers', 'herself',
                                              'it', "it's", 'its', 'itself', ...])),
                 ('clf',
                  SGDClassifier(class_weight={0: 14.108060917644776,
                                              1: 21.275599765944996,
                                              2: 20.49894291754757},
                                n_jobs=-1, random_state=42))]),
 Pipeline(steps=[('tok', TextTokenizer()),
                 ('vec',
                  CountVectorizer(ngram_range=(1, 2),
                                  stop_words=['i', 'me', 'my', 'myself', 'we',
                                              'our', 'ours', 'ourselves', 'you',
                                              "you're", "you've", "you'll",
                                              "you'd", 'your', 'yours',
                                              'yourself', 'yourselves', 'he',
                                              'him', 'his', 'himself', 'she',
                                              "she's", 'her', 'hers', 'herself',
                                              'it', "it's", 'its', 'itself', ...])),
                 ('clf',
                  SGDClassifier(class_weight={0: 14.108060917644776,
                                              1: 21.275599765944996,
                                              2: 20.49894291754757},
                                n_jobs=-1, random_state=42))]),
 Pipeline(steps=[('vec',
                  CountVectorizer(ngram_range=(1, 2),
                                  stop_words=['i', 'me', 'my', 'myself', 'we',
                                              'our', 'ours', 'ourselves', 'you',
                                              "you're", "you've", "you'll",
                                              "you'd", 'your', 'yours',
                                              'yourself', 'yourselves', 'he',
                                              'him', 'his', 'himself', 'she',
                                              "she's", 'her', 'hers', 'herself',
                                              'it', "it's", 'its', 'itself', ...])),
                 ('clf', SGDClassifier(n_jobs=-1, random_state=42))]),
 Pipeline(steps=[('tok', TextTokenizer()),
                 ('vec',
                  CountVectorizer(ngram_range=(1, 2),
                                  stop_words=['i', 'me', 'my', 'myself', 'we',
                                              'our', 'ours', 'ourselves', 'you',
                                              "you're", "you've", "you'll",
                                              "you'd", 'your', 'yours',
                                              'yourself', 'yourselves', 'he',
                                              'him', 'his', 'himself', 'she',
                                              "she's", 'her', 'hers', 'herself',
                                              'it', "it's", 'its', 'itself', ...])),
                 ('clf', SGDClassifier(n_jobs=-1, random_state=42))]))
sgd.fit(x_train, y_train)
pred_default = sgd.predict(x_test)

sgd_2.fit(x_train, y_train)
pred_default_2 = sgd.predict(x_test)

sgd_ds.fit(x_train_2, y_train_2)
pred_ds = sgd_ds.predict(x_test_2)

sgd_ds_2.fit(x_train_2, y_train_2)
pred_ds_2 = sgd_ds_2.predict(x_test_2)
from sklearn.metrics import confusion_matrix, classification_report, f1_score

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))

mat_default = confusion_matrix(y_test, pred_default)
mat_default_2 = confusion_matrix(y_test, pred_default_2)
mat_downsampled = confusion_matrix(y_test_2, pred_ds)
mat_downsampled_2 = confusion_matrix(y_test_2, pred_ds_2)

for i, mat in enumerate([mat_default, mat_default_2, mat_downsampled, mat_downsampled_2]):
    sns.heatmap(mat.T, annot=True, fmt='d', square=True, cbar=False, cmap='Blues', ax=axes.flatten()[i])
    
axes.flatten()[0].set_title('Unbalanced data,weighted')
axes.flatten()[1].set_title('Unbalanced data, weighted, custom tokenizer')
axes.flatten()[2].set_title('Downsampled data')
axes.flatten()[3].set_title('Downsampled data, custom tokenizer')

for i in range(4):
    axes.flatten()[i].set_xlabel('true label')
    axes.flatten()[i].set_ylabel('predicted label')

print(f"Unbalanced data,weighted:\n{classification_report(y_test, pred_default)}\n\n")
print(f"Unbalanced data, weighted, custom tokenizer:\n{classification_report(y_test, pred_default_2)}\n\n")
print(f"Downsampled data:\n{classification_report(y_test_2, pred_ds)}\n\n")
print(f"Downsampled data, custom tokenizer:\n{classification_report(y_test_2, pred_ds_2)}")

plt.tight_layout()
plt.show()
Unbalanced data,weighted:
              precision    recall  f1-score   support

           0       0.81      0.82      0.81      2064
           1       0.64      0.67      0.66      1374
           2       0.79      0.74      0.76      1410

    accuracy                           0.75      4848
   macro avg       0.74      0.74      0.74      4848
weighted avg       0.75      0.75      0.75      4848



Unbalanced data, weighted, custom tokenizer:
              precision    recall  f1-score   support

           0       0.81      0.82      0.81      2064
           1       0.64      0.67      0.66      1374
           2       0.79      0.74      0.76      1410

    accuracy                           0.75      4848
   macro avg       0.74      0.74      0.74      4848
weighted avg       0.75      0.75      0.75      4848



Downsampled data:
              precision    recall  f1-score   support

           0       0.85      0.84      0.85      1390
           1       0.76      0.78      0.77      1362
           2       0.86      0.85      0.86      1369

    accuracy                           0.83      4121
   macro avg       0.83      0.83      0.83      4121
weighted avg       0.83      0.83      0.83      4121



Downsampled data, custom tokenizer:
              precision    recall  f1-score   support

           0       0.87      0.86      0.87      1390
           1       0.77      0.81      0.79      1362
           2       0.88      0.84      0.86      1369

    accuracy                           0.84      4121
   macro avg       0.84      0.84      0.84      4121
weighted avg       0.84      0.84      0.84      4121

Step 4 - Hyperparameter tuning

Using Grid Search Cross Validation method we test all possible parameter combinations and return the one that performs the best.
For this part I chose the model with balanced data and custom tokenizer. It is quite important to find some Biasโ€“variance tradeoff to make it work also in practice.

from sklearn.model_selection import GridSearchCV

parameters = {
    'tok__emoji': [True, False],
    'tok__mentions': [True, False],
    # 'vec__ngram_range': [(1,1), (1,2), (2,3)],
    # 'clf__loss': ['hinge', 'log_loss', 'modified_huber', 'perceptron', 'huber', 'epsilon_insensitive'],
    # 'clf__penalty': ['elasticnet', 'l1', 'l2', None],
    # 'clf__learning_rate': ['constant', 'optimal', 'adaptive', 'invscaling'],
    # 'clf__shuffle': [True, False],
    # 'clf__alpha': np.linspace(0, 10, 5),
    # 'clf__epsilon': np.linspace(0, 10, 5),
    # 'clf__eta0': np.linspace(0, 1, 5),
}
gs_clf = GridSearchCV(sgd_ds_2, parameters, n_jobs=-1, verbose=1)
gs_clf = gs_clf.fit(x_train_2, y_train_2)

print(gs_clf.best_score_)
print(gs_clf.best_params_)

We copy previous cells to update our model with new parameters and retrain them.

sgd_ds_2 = Pipeline([
    ('tok', TextTokenizer()),
    ('vec', CountVectorizer(ngram_range=(1, 2), stop_words=stopwords.words('english'))),
    ('clf', SGDClassifier(random_state=42, n_jobs=-1, penalty='elasticnet', shuffle=True,
                          loss='log_loss', learning_rate='adaptive', eta0=0.11, alpha=7e-05, epsilon=0.8)),
])

# default - unbalanced data, no custom transformer
sgd = deepcopy(sgd_ds_2)
sgd.set_params(clf__class_weight=weights)
sgd.steps.remove(sgd.steps[0])

# default - unbalanced data + custom transformer
sgd_2 = deepcopy(sgd_ds_2)
sgd_2.set_params(clf__class_weight=weights)

# downsampled, no custom transformer
sgd_ds = deepcopy(sgd_ds_2)
sgd_ds.steps.remove(sgd_ds.steps[0])
sgd.fit(x_train, y_train)
pred_default = sgd.predict(x_test)

sgd_2.fit(x_train, y_train)
pred_default_2 = sgd.predict(x_test)

sgd_ds.fit(x_train_2, y_train_2)
pred_ds = sgd_ds.predict(x_test_2)

sgd_ds_2.fit(x_train_2, y_train_2)
pred_ds_2 = sgd_ds_2.predict(x_test_2)
from sklearn.metrics import confusion_matrix, classification_report, f1_score

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))

mat_default = confusion_matrix(y_test, pred_default)
mat_default_2 = confusion_matrix(y_test, pred_default_2)
mat_downsampled = confusion_matrix(y_test_2, pred_ds)
mat_downsampled_2 = confusion_matrix(y_test_2, pred_ds_2)

for i, mat in enumerate([mat_default, mat_default_2, mat_downsampled, mat_downsampled_2]):
    sns.heatmap(mat.T, annot=True, fmt='d', square=True, cbar=False, cmap='Blues', ax=axes.flatten()[i])

axes.flatten()[0].set_title('Unbalanced data,weighted')
axes.flatten()[1].set_title('Unbalanced data, weighted, custom tokenizer')
axes.flatten()[2].set_title('Downsampled data')
axes.flatten()[3].set_title('Downsampled data, custom tokenizer')

for i in range(4):
    axes.flatten()[i].set_xlabel('true label')
    axes.flatten()[i].set_ylabel('predicted label')

print(f"Unbalanced data,weighted:\n{classification_report(y_test, pred_default)}\n\n")
print(f"Unbalanced data, weighted, custom tokenizer:\n{classification_report(y_test, pred_default_2)}\n\n")
print(f"Downsampled data:\n{classification_report(y_test_2, pred_ds)}\n\n")
print(f"Downsampled data, custom tokenizer:\n{classification_report(y_test_2, pred_ds_2)}")

plt.tight_layout()
plt.show()
Unbalanced data,weighted:
              precision    recall  f1-score   support

           0       0.84      0.84      0.84      2064
           1       0.67      0.69      0.68      1374
           2       0.79      0.76      0.77      1410

    accuracy                           0.77      4848
   macro avg       0.76      0.76      0.76      4848
weighted avg       0.77      0.77      0.77      4848



Unbalanced data, weighted, custom tokenizer:
              precision    recall  f1-score   support

           0       0.84      0.84      0.84      2064
           1       0.67      0.69      0.68      1374
           2       0.79      0.76      0.77      1410

    accuracy                           0.77      4848
   macro avg       0.76      0.76      0.76      4848
weighted avg       0.77      0.77      0.77      4848



Downsampled data:
              precision    recall  f1-score   support

           0       0.85      0.85      0.85      1390
           1       0.77      0.77      0.77      1362
           2       0.86      0.86      0.86      1369

    accuracy                           0.83      4121
   macro avg       0.83      0.83      0.83      4121
weighted avg       0.83      0.83      0.83      4121



Downsampled data, custom tokenizer:
              precision    recall  f1-score   support

           0       0.87      0.85      0.86      1390
           1       0.76      0.80      0.78      1362
           2       0.87      0.84      0.86      1369

    accuracy                           0.83      4121
   macro avg       0.83      0.83      0.83      4121
weighted avg       0.84      0.83      0.83      4121

Step 5 - Testing in practice

For tests, I prepared 3 lists that coresponds to each sentiment,
every list contains 10 sentences with ~5 words length and 10 sentences with ~10 words length.

negative_sentences = [
    # ~5 Words
    "I hate this so much. ๐Ÿ˜ก",
    "Worst experience ever. ๐Ÿ‘Ž #disappointed",
    "This is so frustrating. ๐Ÿ˜ ",
    "Totally not worth it. ๐Ÿ’ธ",
    "@user213 You ruined everything. ๐Ÿ˜ค",
    "I canโ€™t stand this. ๐Ÿ˜’",
    "This is pure garbage. ๐Ÿšฎ",
    "Unbelievably bad service. ๐Ÿ˜ค #neveragain",
    "What a waste of time. ๐Ÿ•’",
    "Seriously the worst ever. โŒ",

    # ~10 Words
    "I can't believe how awful this turned out to be. ๐Ÿ˜ก #fail",
    "@user213 You really let me down, very disappointed in you.",
    "This is the worst product Iโ€™ve ever purchased. Refund, please! ๐Ÿ’ธ",
    "Completely ruined my day, thanks for nothing. ๐Ÿ˜  #neveragain",
    "Everything about this is just terrible, never using it again. โŒ",
    "@user213 Your customer support is useless and unhelpful, so frustrating! ๐Ÿ˜ค",
    "Honestly, I expected much better from you, this is trash. ๐Ÿšฎ",
    "Canโ€™t believe I wasted money on this, so regretful. ๐Ÿ’ธ",
    "I am so upset right now, what a huge letdown. ๐Ÿ˜ก",
    "Honestly, this entire experience has been nothing but a headache. ๐Ÿ˜ "
]

neutral_sentences = [
    # ~5 Words
    "It was okay, nothing special. ๐Ÿคทโ€โ™‚๏ธ",
    "Not good, not bad either.",
    "Just an average experience today.",
    "Meh, itโ€™s alright I guess. ๐Ÿ˜",
    "@user213 Could be better, honestly.",
    "This is neither here nor there.",
    "Just a regular day. ๐Ÿค”",
    "I feel indifferent about it.",
    "Nothing to complain about. ๐Ÿคทโ€โ™€๏ธ",
    "It's fine, I suppose. ๐Ÿ˜ถ",

    # ~10 Words
    "I guess itโ€™s just fine, nothing really stood out. ๐Ÿคทโ€โ™‚๏ธ",
    "Not amazing, but not terrible either, just kind of average.",
    "@user213 Itโ€™s okay, not sure how I feel about it.",
    "This was pretty much what I expected, nothing surprising here.",
    "Honestly, Iโ€™m neither impressed nor disappointed, just neutral. ๐Ÿ˜",
    "I donโ€™t really have a strong opinion on this one.",
    "Itโ€™s fine, nothing to rave about or criticize. ๐Ÿคทโ€โ™€๏ธ",
    "Neither satisfied nor dissatisfied, just another average experience. ๐Ÿค”",
    "@user213 It was pretty standard, nothing particularly great or bad.",
    "Iโ€™d call it a very average experience, to be honest. ๐Ÿ˜ถ"
]

positive_sentences = [
    # ~5 Words
    "Absolutely loved it! ๐Ÿ˜ #amazing",
    "This made my day! ๐Ÿ˜Š",
    "Fantastic job, @user213! ๐ŸŒŸ",
    "Iโ€™m so happy! ๐Ÿฅณ #blessed",
    "Worth every penny! ๐Ÿ’ฐ",
    "Super excited about this! ๐Ÿ˜ƒ",
    "Best decision ever made. ๐Ÿ™Œ",
    "Love this so much! ๐Ÿ’–",
    "So proud of you, @user213!",
    "Canโ€™t stop smiling! ๐Ÿ˜Š",

    # ~10 Words
    "Iโ€™m incredibly happy with this, exceeded all my expectations! ๐Ÿ˜",
    "Thank you, @user213, for such a fantastic experience! ๐ŸŒŸ #grateful",
    "This product has genuinely improved my life, super grateful! ๐Ÿ’–",
    "I am over the moon with how this turned out. ๐Ÿ˜Š",
    "Wow, just wow! Couldnโ€™t have asked for anything better. ๐Ÿ™Œ",
    "Amazing experience from start to finish, highly recommend! ๐ŸŒŸ #bestdayever",
    "Iโ€™m so glad I tried this, totally worth it. ๐Ÿ˜Š",
    "You nailed it, @user213! Iโ€™m really impressed! ๐Ÿ‘",
    "I couldnโ€™t be happier with the results, totally satisfied. ๐Ÿ˜ƒ",
    "This exceeded my expectations, truly a delightful surprise! ๐ŸŒŸ"
]

score_df = {
    'Model': ['positive', 'neutral', 'negative', 'average'],
    'Default': [0, 0, 0, 0],
    'Default + Custom tokenizer': [0, 0, 0, 0],
    'Downsampled': [0, 0, 0, 0],
    'Downsampled + Custom tokenizer': [0, 0, 0, 0]
}

score_df = pd.DataFrame(score_df).set_index('Model')

for i_2, model in enumerate([sgd, sgd_2, sgd_ds, sgd_ds_2]):
    avg_score = 0
    for i_1, sentiment in enumerate([negative_sentences, neutral_sentences, positive_sentences]):
        predictions = model.predict(sentiment)
        score = (np.array(predictions == i_1).sum() / 20) * 100
        score_df.iloc[i_1, i_2] = score
        avg_score += score

    avg_score =  round(avg_score / 3, 1)
    score_df.iloc[3, i_2] = avg_score
score_df
Default Default + Custom tokenizer Downsampled Downsampled + Custom tokenizer
Model
positive 80.0 100.0 75 95
neutral 45.0 50.0 40 65
negative 80.0 95.0 80 95
average 68.3 81.7 65 85
fig = plt.figure(figsize=(10, 6))
fig = sns.barplot(data=score_df.iloc[3], palette=sns.color_palette('Blues', 4))
fig.set_title('Score vs Model', fontweight='bold')
fig.set_ylim(0, 100)
plt.show()

IV Conclusion

I'm very satisfied with practical test results, the custom tokenizer works way better in practice and actually had some kind of impact.
I think that after more tweaking it has potential to be used in social media websites such as: Instagram, Twitter or YouTube.

Thank you for reading this project and I hope you enjoyed the whole process :]

@ Gracjan Pawล‚owski 2024